22  Logistic Regression

Author

Vladimir Buskin

22.2 Preparation

Consider the data from Buskin’s (n.d.)1 corpus-study on subject pronoun realisation:

  • 1 The input data is can be downloaded from this OSF repository: https://osf.io/qgnms.

  • # Load libraries
    library(tidyverse)
    library(rms)
    library(broom)
    library(sjPlot)
    library(ggeffects)
    library(ggpubr)
    
    # Load data
    data_pro <- read.csv("INPUT_pronouns.csv", sep = ",", header = TRUE)
    
    # Inspect data
    str(data_pro)
    'data.frame':   4838 obs. of  5 variables:
     $ Reference     : chr  "overt" "overt" "overt" "overt" ...
     $ Person        : chr  "3" "3" "3" "3" ...
     $ Register      : chr  "S1A" "S1A" "S1A" "S1A" ...
     $ Variety       : chr  "GB" "GB" "GB" "GB" ...
     $ Referentiality: chr  "referential" "referential" "referential" "referential" ...
    head(data_pro)
      Reference Person Register Variety Referentiality
    1     overt      3      S1A      GB    referential
    2     overt      3      S1A      GB    referential
    3     overt      3      S1A      GB    referential
    4     overt      3      S1A      GB    referential
    5     overt      3      S1A      GB    referential
    6     overt      3      S1A      GB    referential
    • Target variable:

      • Reference (‘overt’, ‘null’)
    • Explanatory variables:

      • Person (‘1.p.’, ‘2.p’, ‘3.p’ as well as the dummy pronouns ‘it’ and ‘there’)

      • Register (the text category in the International Corpus of English; ‘S1A’ are informal conversations, whereas ‘S1B’ comprises formal class lessons)

      • Variety (British English ‘GB’, Singapore English ‘SING’ and Hong Kong English ‘HK’) and

      • Referentiality (‘referential’ with an identifiable referent or ‘non-referential’ with no/generic reference)

    head(data_pro)
      Reference Person Register Variety Referentiality
    1     overt      3      S1A      GB    referential
    2     overt      3      S1A      GB    referential
    3     overt      3      S1A      GB    referential
    4     overt      3      S1A      GB    referential
    5     overt      3      S1A      GB    referential
    6     overt      3      S1A      GB    referential
    table(data_pro$Reference)
    
     null overt 
      174  4664 

    22.2.1 Descriptive overview

    22.3 Logistic regression

    In contrast to linear regression, logistic regression models a qualitative response variable \(Y\) with two outcomes2. In the present study, \(Y\) is pronominal Reference and has the outcomes Reference = null and Reference = overt, which represent null and overt subjects, respectively. Dichotomous variables of this kind are also often coded as yes/no or 1/0.

  • 2 Logistic regression can also be used for \(\geq 3\) classes by breaking down the response variable into a series of dichotomous variables. This is also known as multinomial logistic regression or softmax regression.

  • Another difference from linear regression is the output of the model. In linear regression, we obtain a predicted value for the continuous response variable we’re interested in. For instance, if we’re modelling reaction times, the model will return an estimated reaction time (given the predictors).

    In logistic regression, however, we either get a class label or a probability. When modelling pronominal reference, the model will thus either tell us

    1. whether a speaker would use an overt or a null subject in a given observation (class prediction).

    2. what the probability of using one variant vs. the other would be (probability prediction).

    A core component of logistic regression is the logistic function. The rationale for using it is that the output of the function will always lie between \(0\) and \(1\), and it will always denote a probability.

    22.3.1 The simple logistic model

    Assuming a binary response variable \(Y\) with the values 1 and 0 and a single predictor \(X\), we can model the probability \(P(Y = 1 \mid X)\) as

    \[ P(Y = 1 \mid X) = \frac{e^{\beta_0 + \beta_1X}}{1 + e^{\beta_0 + \beta_1X}}. \]

    The above expression is equivalent to

    \[ \log\left(\frac{P(Y = 1 \mid X)}{1 - P(Y = 1 \mid X)}\right) = \beta_0 + \beta_1X. \]

    The fraction \(\frac{P(Y = 1 \mid X)}{1-P(Y = 1 \mid X)}\) represents the odds, which stand for to the probability of one outcome (e.g., Reference = null) compared to the other (e.g., Reference = overt).

    Their logarithmic transformation are the log odds (or logits) of one outcome versus the other.

    Understanding log odds

    When interpreting the output of a logistic model, note that

    • positive log odds indicate an increase in \(\frac{P(Y = 1 \mid X)}{1-P(Y = 1 \mid X)}\), whereas

    • negative log odds indicate a decrease in \(\frac{P(Y = 1 \mid X)}{1-P(Y = 1 \mid X)}\).

    In more concrete terms: If \(X = \text{Register}\), then our model would have the form:

    \[ \log\left(\frac{P(\text{Reference} = \text{null} \mid \text{Register})}{1- P(\text{Reference} = \text{null} \mid \text{Register})}\right) = \beta_0 + \beta_1\text{Register} \]

    Given Register, the log odds of a null subject versus an overt one is essentially equivalent to the linear model equation. The model parameters \(\beta_0\) and \(\beta_1\) are typically estimated using Maximum Likelihood Estimation3 (MLE).

  • 3 The goal of the fitting procedure is to maximise the likelihood function of the parameter matrix \(\ell(\mathbf{\beta})\). Via partial differentiation, an intermediary result can be attained which can only be solved using an iterative algorithm (e.g. Newton-Ralphson). For further technical details, see Wood (2006: 63-66) or Agresti & Kateri (2022: 291-294).

  • 22.3.2 Multiple logistic regression

    If more than one predictor is included, the above equations can be expanded so as to take into account \(1, ..., p\) slopes and \(1, ..., p\) independent variables \(X_p\).

    \[ P(Y = 1 \mid X_1, ..., X_p) = \frac{e^{\beta_0 + \beta_1X_1 + ... + \beta_pX_p}}{1 + e^{\beta_0 + \beta_1X_1 + ... + \beta_pX_p}}. \]

    Thus, the log odds correspond to the sum of \(\beta_pX_p\),

    \[ \begin{aligned} \log\left(\frac{P(Y = 1 \mid X_1, ..., X_p)}{1 - P(Y = 1 \mid X_1, ..., X_p)}\right) & = \beta_0 + \beta_1X_1 + ... + \beta_pX_p \\ & = \beta_0 + \sum_{i=1}^{p} \beta_i X_i \end{aligned} \] respectively.

    22.3.3 Odds ratios

    To assess the strength of an effect, it is instructive to examine the odds ratios that correspond to the model coefficients. Odds ratios (OR) are defined as

    \[ OR(X_1) = e^{\beta_1}. \]

    Essentially, the OR describes the ratio between two odds with respect to another independent variable. This is illustrated for Reference given Register below:

    \[ \text{OR}(\text{Reference} \mid \text{Register}) = \frac{\frac{P(\text{Reference} = \text{null} \mid \text{Register} = \text{S1A})}{P(\text{Reference} = \text{overt} \mid \text{Register} = \text{S1A})}}{\frac{P(\text{Reference} = \text{null} \mid \text{Register} = \text{S1B})}{P(\text{Reference} = \text{overt} \mid \text{Register} = \text{S1B})}} \]

    Read as: ‘The ratio between the probability of a null vs. overt object in S1A and the probability of a null vs. overt object in S1B’.

    22.4 Workflow in R

    22.4.1 Step 1: Research question and hypotheses

    How do the intra- and extra-linguistic variables suggested in the literature affect subject pronoun realisation (Definite Null Instantiation) in British English, Singapore English and Hong Kong English?

    Given a significance level \(\alpha = 0.05\), the hypotheses are: \[ \begin{aligned} H_0: & \quad \text{None of the predictor coefficients deviate from 0}.\\ H_1: & \quad \text{At least one predictor coefficient deviates from 0}. \end{aligned} \]

    These can be restated mathematically as:

    \[ \begin{aligned} H_0: & \quad \beta_1 = \beta_2 = \cdots = \beta_p = 0 \\ H_1: & \quad \text{At least one } \beta_i \neq 0 \text{ for } i \in \{1, 2, \ldots, p\} \end{aligned} \]

    22.4.2 Step 2: Convert to factors and specify reference levels

    The next step involves specifying reference levels for all categorical variables. This step is very important because it will directly impact the parameter estimation and, consequently, influence our interpretation of the model output.

    • The reference level of the response is usually chosen such that it corresponds to the unmarked or most frequent case. Since overt pronouns are much more common in the data, the reference level of the Reference variable will be set to Reference = overt. This way, the model coefficients will directly represent the probability of the null subject variant (i.e., the special case) given certain predictor configurations.

    • The predictor levels need to be specified as well. Among other things, we are interested in how the Asian Englishes pattern relative to British English. Therefore, we will define British English as the baseline for comparison.

    We will use the following specifications:

    Variable Factor Levels Preferred Reference level
    Register S1A, S1B S1A
    Variety GB, SING, HK GB
    Person 1, 2, 3, it, there 3
    Referentiality referential, non-referential referential
    # Store "Reference" as factor
    data_pro$Reference <- as.factor(data_pro$Reference)
    
    ## Specify reference level (the 'unmarked' case)
    data_pro$Reference <- relevel(data_pro$Reference, "overt")
    
    ## Print levels
    levels(data_pro$Reference)
    [1] "overt" "null" 

    Repeat the procedure for the remaining categorical variables.

    Code
    # Store "Register" as factor
    data_pro$Register <- as.factor(data_pro$Register)
    
    ## Specify reference level
    data_pro$Register <- relevel(data_pro$Register, "S1A")
    
    # Store "Variety" as factor
    data_pro$Variety <- as.factor(data_pro$Variety)
    
    ## Specify reference level
    data_pro$Variety <- relevel(data_pro$Variety, "GB")
    
    # Store "Person" as factor
    data_pro$Person <- as.factor(data_pro$Person)
    
    ## Specify reference level
    data_pro$Person <- relevel(data_pro$Person, "3")
    
    # Store "Referentiality" as factor
    data_pro$Referentiality <- as.factor(data_pro$Referentiality)
    
    ## Specify reference level
    data_pro$Referentiality <- relevel(data_pro$Referentiality, "referential")

    22.4.3 Step 3: Fit the model

    There are two functions that can fit logistic models in R: lrm() and glm().

    Note

    The model formula below does not include Referentiality because several intermediary steps revealed it to be almost completely irrelevant for predicting Reference. In addition, the existing (and significant) interaction Variety:Person has been excluded to improve the interpretability of the model.

    # With lrm(); requires library("rms")
    
    # Fit interaction model
    Reference.lrm <- lrm(Reference ~ Register + Variety + Register:Variety + Person, data = data_pro)
    
    # View model statistics
    Reference.lrm
    Logistic Regression Model
    
    lrm(formula = Reference ~ Register + Variety + Register:Variety + 
        Person, data = data_pro)
    
                           Model Likelihood      Discrimination    Rank Discrim.    
                                 Ratio Test             Indexes          Indexes    
    Obs          4838    LR chi2     120.43      R2       0.092    C       0.729    
     overt       4664    d.f.             9     R2(9,4838)0.023    Dxy     0.458    
     null         174    Pr(> chi2) <0.0001    R2(9,503.2)0.199    gamma   0.488    
    max |deriv| 4e-10                            Brier    0.034    tau-a   0.032    
    
                                Coef    S.E.   Wald Z Pr(>|Z|)
    Intercept                   -3.4132 0.2746 -12.43 <0.0001 
    Register=S1B                 0.0269 0.3807   0.07 0.9437  
    Variety=HK                   0.6712 0.3174   2.11 0.0345  
    Variety=SING                 1.1193 0.2959   3.78 0.0002  
    Person=1                    -0.8807 0.1811  -4.86 <0.0001 
    Person=2                    -1.6441 0.2695  -6.10 <0.0001 
    Person=it                    0.7897 0.2978   2.65 0.0080  
    Person=there                -2.5641 1.0095  -2.54 0.0111  
    Register=S1B * Variety=HK    0.6035 0.4521   1.34 0.1819  
    Register=S1B * Variety=SING -0.4753 0.4688  -1.01 0.3107  
    # With (glm); available in base R
    # Note the additional "family" argument!
    Reference.glm <- glm(Reference ~ Register + Variety + Register:Variety + Person, data = data_pro, family = "binomial")
    
    # View model statistics
    summary(Reference.glm)
    tab_model(Reference.glm, show.se = TRUE, show.aic = TRUE, show.dev = TRUE, show.fstat = TRUE, transform = NULL)
      Reference
    Predictors Log-Odds std. Error CI p
    (Intercept) -3.41 0.27 -3.99 – -2.91 <0.001
    Register [S1B] 0.03 0.38 -0.74 – 0.77 0.944
    Variety [HK] 0.67 0.32 0.06 – 1.32 0.034
    Variety [SING] 1.12 0.30 0.56 – 1.73 <0.001
    Person [1] -0.88 0.18 -1.24 – -0.53 <0.001
    Person [2] -1.64 0.27 -2.21 – -1.14 <0.001
    Person [it] 0.79 0.30 0.18 – 1.35 0.008
    Person [there] -2.56 1.01 -5.44 – -1.05 0.011
    Register [S1B] × Variety
    [HK]
    0.60 0.45 -0.28 – 1.50 0.182
    Register [S1B] × Variety
    [SING]
    -0.48 0.47 -1.40 – 0.45 0.311
    Observations 4838
    R2 Tjur 0.030
    Deviance 1378.406
    AIC 1398.406
    Stepwise variable selection

    With the function drop1(), it is possible to successively remove variables from the complex model to ascertain which ones improve the model significantly (i.e., decrease the deviance and AIC scores).

    drop1(Reference.glm, test = "Chisq")
    Single term deletions
    
    Model:
    Reference ~ Register + Variety + Register:Variety + Person
                     Df Deviance    AIC    LRT Pr(>Chi)    
    <none>                1378.4 1398.4                    
    Person            4   1460.2 1472.2 81.828  < 2e-16 ***
    Register:Variety  2   1387.5 1403.5  9.100  0.01057 *  
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    22.4.4 Step 4: Confidence intervals and odds ratios

    # Tidy the model output
    tidy_model <- tidy(Reference.glm, conf.int = TRUE)
    
    # Remove intercept, compute odds ratios and their CIs
    tidy_model <- tidy_model %>% 
      filter(term != "(Intercept)") %>% 
      mutate(
        odds_ratio = exp(estimate),
        odds.conf.low = exp(conf.low),
        odds.conf.high = exp(conf.high)
      )

    22.4.5 Step 5: Visualise the model

    • Plot model coefficients:
    Code
    # Create the coefficient plot
    ggplot(tidy_model, aes(x = estimate, y = term)) +
      geom_point() +
      geom_errorbarh(aes(xmin = conf.low, xmax = conf.high), height = 0.2) +
      geom_vline(xintercept = 0, linetype = "dashed", color = "steelblue") +
      theme_minimal() +
      labs(
        x = "Coefficient Estimate (log-odds)",
        y = "Predictor",
        title = "Coefficient Estimates with Confidence Intervals",
        caption = "*Note that the CIs of singificant predictors do not include 0."
      )

    Code
    # Plot odds ratios
    ggplot(tidy_model, aes(x = exp(estimate), y = term)) +
      geom_point() +
      geom_errorbarh(aes(xmin = odds.conf.low, xmax = odds.conf.high), height = 0.2) +
      geom_vline(xintercept = 1, linetype = "dashed", color = "steelblue") +
      theme_minimal() +
      labs(
        x = "Coefficient Estimate (odds ratios)",
        y = "Predictor",
        title = "Odds ratios with Confidence Intervals",
        caption = "*Note that the CIs of singificant predictors do not include 1."
      )

    • Plot predicted probabilities:
    Code
    # Use ggeffect() from the ggeffects package
    plot(ggeffect(Reference.glm, terms = c("Register"))) + geom_line(col = "steelblue")

    Code
    plot(ggeffect(Reference.glm, terms = c("Variety"))) + geom_line(col = "steelblue")

    Code
    plot(ggeffect(Reference.glm, terms = c("Person"))) + geom_line(col = "steelblue")

    22.4.6 Step 6: Interpret the model

    The logistic regression model is statistically significant at \(p < 0.001\) (\(\chi^2 = 120.43\), \(df = 9\)) and has acceptable fit (Nagelkerke’s-\(R^2\) = \(0.09\), \(C = 0.73\)).

    The model coefficients indicate that null subjects are significantly more likely in Singapore English compared to British English (Estimate = 1.12, 95% CI [0.56, 1.73], \(p < 0.001\)). This effect is moderate with an \(OR\) of 3.06 (95% CI [1.75, 5.64]), suggesting that the probability of subject omission is elevated by a factor of approximately 3 in the Singaporean variety.

    22.4.7 Step 7: Further model diagnostics

    • Cross-validation
    Code
    # Set seed for reproducibility
    set.seed(123)
    
    # Refit the model with additional settings
    Reference.val <- lrm(Reference ~ Register + Variety + Register:Variety + Person, data = data_pro, x = T, y = T)
    
    # Perform 200-fold cross-validation
    model.validated <- validate(Reference.val, B = 200)
    
    # Slope optimism should be as low possible!
    model.validated
              index.orig training    test optimism index.corrected   n
    Dxy           0.4592   0.4655  0.4456   0.0200          0.4393 200
    R2            0.0923   0.0981  0.0843   0.0138          0.0785 200
    Intercept     0.0000   0.0000 -0.2177   0.2177         -0.2177 200
    Slope         1.0000   1.0000  0.9262   0.0738          0.9262 200
    Emax          0.0000   0.0000  0.0622   0.0622          0.0622 200
    D             0.0247   0.0263  0.0225   0.0038          0.0208 200
    U            -0.0004  -0.0004  0.0002  -0.0006          0.0002 200
    Q             0.0251   0.0268  0.0223   0.0045          0.0206 200
    B             0.0336   0.0336  0.0337  -0.0001          0.0337 200
    g             1.0081   1.2487  1.1361   0.1127          0.8954 200
    gp            0.0319   0.0326  0.0303   0.0022          0.0297 200
    • Multicollinearity
    Code
    # Variable inflation factors further reveal severe multicollinearity
    vif(Reference.lrm)
                   Register=S1B                  Variety=HK 
                       5.818111                    4.006084 
                   Variety=SING                    Person=1 
                       3.421687                    1.140407 
                       Person=2                   Person=it 
                       1.089502                    1.102908 
                   Person=there   Register=S1B * Variety=HK 
                       1.007148                    6.218803 
    Register=S1B * Variety=SING 
                       3.685075 
    Agresti, Alan, and Maria Kateri. 2022. Foundations of Statistics for Data Scientists: With r and Python. Boca Raton: CRC Press.
    Buskin, Vladimir. n.d. “Definite Null Instantiation in English(es): A Usage-based Construction Grammar Approach.” Constructions and Frames.
    Hosmer, David W., and Stanley Lemeshow. 2008. Applied Logistic Regression. 2nd ed. New York: Wiley.
    James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2021. An Introduction to Statistical Learning: With Applications in r. New York: Springer. https://doi.org/10.1007/978-1-0716-1418-1.
    Levshina, Natalia. 2015. How to Do Linguistics with r: Data Exploration and Statistical Analysis. Amsterdam; Philadelphia: John Benjamins Publishing Company.
    Winter, Bodo. 2020. Statistics for Linguists: An Introduction Using r. New York; London: Routledge.
    Wood, Simon N. 2006. Generalized Additive Models: An Introduction with R. Boca Raton: Chapman & Hall/CRC.